NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Which of These Best Describes Multiple Choice Evaluation with LLMs? A) Forced B) Flawed C) Fixable D) All of the Above

Balepur, Nishant; Rudinger, Rachel; Boyd-Graber, Jordan (July 2025, Association for Computational Linguistics)

Most people dislike taking multiple-choice tests, so why are they the default way we evaluate NLP systems? This position paper argues that, despite its simplicity and popularity, multiple-choice evaluation is flawed, both in its format and the datasets it relies on. Drawing from educational testing theory, we propose practical fixes for these issues, helping us build evaluations that better test knowledge and reflect how humans use NLP systems.
more » « less
Free, publicly-accessible full text available July 27, 2026
Whose Boat Does it Float? Improving Personalization in Preference Tuning via Inferred User Personas

Balepur, Nishant; Padmakumar, Vishakh; Yang, Fumeng; Feng, Shi; Rudinger, Rachel; Boyd-Graber, Jordan (July 2025, Association for Computational Linguistics)

Language models are optimized to learn which responses you prefer, but they don't learn why you preferred a particular response. This limits their ability to tailor to personalized requests (e.g., "What should I eat for dinner? I'm vegetarian"), so we introduce a simple fix: have models infer personas that explain why users could prefer responses. We show training on these inferred personas leads to responses that are significantly more personalized for user needs.
more » « less
Free, publicly-accessible full text available July 27, 2026
Reverse Question Answering: Can an LLM Write a Question so Hard (or Bad) that it Can’t Answer?

https://doi.org/10.18653/v1/2025.naacl-short.5

Balepur, Nishant; Gu, Feng; Ravichander, Abhilasha; Feng, Shi; Boyd-Graber, Jordan; Rudinger, Rachel (January 2025, emae)

Language models like ChatGPT are pretty good at answering questions (e.g. "What is 12 * 12?"), but we show they can surprisingly struggle when asked to do the reverse task: generating questions for answers (e.g. "Give me a question with the answer 144"). We study when these errors happen, what might be causing them, and how they can be addressed.
more » « less
Full Text Available
A SMART Mnemonic Sounds like “Glue Tonic”: Mixing LLMs with Student Feedback to Make Mnemonic Learning Stick

https://doi.org/10.18653/v1/2024.emnlp-main.786

Balepur, Nishant; Shu, Matthew; Hoyle, Alexander; Robey, Alison; Feng, Shi; Goldfarb-Tarrant, Seraphina; Boyd-Graber, Jordan Lee (January 2024, Association for Computational Linguistics)

Learning vocabulary (e.g., benevolent) can be tedious, but using mnemonics (e.g., benevolent sounds like "benefits," and a kind boss gives benefits) makes it more engaging and effective. This paper introduces SMART, a large language model trained to produce mnemonics based on feedback from flashcard learners. Students struggle to predict which mnemonics will help them most. Still, by training SMART on both student preferences and learning outcomes, we can generate mnemonics as effectively as GPT-4, but at a much lower cost.
more » « less
Full Text Available
DynaMiTE: Discovering Explosive Topic Evolutions with User Guidance

https://doi.org/10.18653/v1/2023.findings-acl.14

Balepur, Nishant; Agarwal, Shivam; Venkat Ramanan, Karthik; Yoon, Susik; Yang, Diyi; Han, Jiawei (January 2023, Association for Computational Linguistics)

Dynamic topic models (DTMs) analyze text streams to capture the evolution of topics. Despite their popularity, existing DTMs are either fully supervised, requiring expensive human annotations, or fully unsupervised, generating topic evolutions that often do not cater to a user’s needs. Further, the topic evolutions produced by DTMs tend to contain generic terms that are not indicative of their designated time steps. To address these issues, we propose the task of discriminative dynamic topic discovery. This task aims to discover topic evolutions from temporal corpora that distinctly align with a set of user-provided category names and uniquely capture topics at each time step. We solve this task by developing DynaMiTE, a framework that ensembles semantic similarity, category indicative, and time indicative scores to produce informative topic evolutions. Through experiments on three diverse datasets, including the use of a newly-designed human evaluation experiment, we demonstrate that DynaMiTE is a practical and efficient framework for helping users discover high-quality topic evolutions suited to their interests.
more » « less
Full Text Available

Search for: All records